Image-to-Image Translation with Conditional Adversarial Networks - summary

Author: Nikolay Chehlarov date: 04.02.2022

paper: Image-to-Image Translation with Conditional Adversarial Networks

authors: Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros Berkeley AI Research (BAIR) Laboratory, UC Berkeley

Link to paper: https://arxiv.org/abs/1611.07004
Official implementation: https://github.com/phillipi/pix2pix.

Abstract

The aim of the model is translate one to another image. Given input (picture) in one modality, output in different modality is generated. Training is done on pairs of pictures in two different modalities. Example pairs include satellite image - Google maps image, day - night, sketch - bag, semantic segment mask - real image and many more. One can provide a semantic segmented mask and obtain a realistic picture for example.

An on-line implementation can be used https://affinelayer.com/pixsrv/ to test the approach.

Traditionally, this tasks has been tackled with separate, special-purpose machinery, despite the fact that the setting is always the same: predict pixels from pixels. The goal in this paper is to develop a common framework for all these problems.

fig.1

Highlights

Model architecture

The pix2pix is based on conditional GAN architecture. The main building block of the pix2pix architecture are:

Unlike typical cGAN, the generator accepts combination of two losses - one from the discriminator (GAN loss, binary cross entropy) and one for the difference between the generated output to the input pair (the mask). The second loss is mean absolute error, and is referred as L1 loss. In the paper the combined loss is weighted sum, with L1 having weight of 100. Unlike an unconditional GAN, both the generator and discriminator observe the input edge map.

The objective of a conditional GAN can be expressed as $$ \mathscr{L}_{cGAN}(G, D)=E_{x, y}[logD(x, y))] + E_{x, y}[log(1 - D(x, G(x, z)))]$$ where G tries to minimize this objective against an adversarial D that tries to maximize it.

fig.2

The conditional GAN the loss is learned, and can, in theory, penalize any possible structure that differs between output and target. Conditional GANs learn a mapping from observed image xr and random noise vector z, to y.

Key details from the paper

Implementation

The implementation is originated from https://github.com/bnsreenu/python_for_microscopists/tree/master/251_satellite_image_to_maps_translation with minimal modifications. This source is based on the code by Jason Brownlee from his blogs on https://machinelearningmastery.com/.

The model is targeted to create maps from satellite images.

Generator:
Input - source image; Output - target image

The encoder-decoder architecture consists of:

encoder: C64-C128-C256-C512-C512-C512-C512-C512

decoder: CD512-CD512-CD512-C512-C256-C128-C64

Discriminator:

Input - pair of source and target image; output - probability between 0 and 1 that the inputs are actually a pair

C64-C128-C256-C512

After the last layer, a convolution is applied to map to a 1-dimensional output, followed by a Sigmoid function.

Function definition

Loading data

Data from: http://efrosgans.eecs.berkeley.edu/pix2pix/datasets/maps.tar.gz
Other datasets can be found here: http://efrosgans.eecs.berkeley.edu/pix2pix/datasets/

Defining and training the model

Loading a saved model

The trainned model after 20 epochs can be downloaded from here: https://drive.google.com/file/d/1LvZ5izBhOUAh1qDHDHUK-ZlV703kwzmi/view?usp=sharing

Test the model on few images

Discussion

Looking into the evolution of the model with training, the improvement is slow after the first 10 epochs. The training is short (only 20 epochs), but already delivered promising results. Possible and practical improvement might be to use several satellite images, taken during different season/camera angle/light direction.
pix2pix model is used to translate image from one domain to another. It was demonstrated how to train and convert satellite images to the corresponding maps images.

References